LIMA – Less Is More for Alignment

Large Language Models
Author

Maxime Lbonne

Published

July 7, 2023

Tip

LIMA is a 65B parameter LLaMA model fine-tuned (supervised learning) using 1,000 samples, which outperforms DaVinci003.

📝 Paper: https://arxiv.org/abs/2305.11206

Superficial Alignement Hypothesis: alignement is a process where the model learns to leverage knowledge and capabilities that were acquired during pre-training.

Tip

Although I find this hypothesis credible, I do not feel like the paper truly proves it. Proving it would require a much more in-depth analysis based on OOD tasks and information.

The model is fine-tuned on 1,000 examples that approximate user prompts and high-quality responses (750 from forums + 250 manually written examples), optimized for quality and diversity.

Alignment Data

Tip

Training models is easy but building high-quality datasets is hard. This is the most interesting part of the paper to me.

The authors collect data from three popular QA websites:

  • Stack Exchange:
    • Divide the exchanges into 75 STEM exchanges + 99 other (English, cooking, travel, etc.) + discard 5 nice exchanges.
    • Sample 200 QAs from each set (= exchange or STEM/other?).
    • Wthin each exchange, take the questions with the highest score (needs to self-contained in the title = no body).
    • Select the top answer for each question with score >= 10, 1200 characters < length < 4096 characters, not written in the first person, and without reference to other answers.
    • Links, images, and other HTML tags (except code blocks and lists) are also removed rom the answers.
    • They randomly select the title or the description as questions since Stack Exchange has both.
  • wikiHow:
    • Sample 200 articles while ensuring a diversity of categories.
    • Prompt = title (“How to cook an omelette?”) and response = article’s body.
    • Replace “This article…” with “The following answer…”
    • Remove links, images, and certain sections of the text.
  • Pushshift Reddit Dataset:
    • Restricted to two subreddits: r/AskReddit and r/WritingPrompts.
    • Manually select examples fro within the most upvoted posts.
    • QAs from r/AskReddit are deemed not necessarily reliable, which is why they are used as a test set.

The authors also manually authored examples. They created two groups (A and B) to create 250 prompts each, based on personal interests. They selected 200 prompts from Group A and (50 prompts as a held-out development set) + 230 prompts from Group B (for test only).

They also manually wrote high-quality answers, with a uniform tone and some acknowledgement of the question, followed by the answer. In this data, they include 13 adversarial training prompts (toxic, malevolent) + a rejection in the corresponding answers.

They also add 50 samples from Super-Natural Instructions, which are modified to correspond to the style of the 200 manual examples.

Training LIMA

They trained a 65B parameter LLaMA model on 1,000 samples. To distinguish user and assistant, they introduce an end-of-turn token (EOT) at the end of each utterance.

Hyperparameters:

  • 15 epochs using AdamW (\beta_1 = 0.9, \beta_2 = 0.95, and weght decay of 0.1).
  • No warmup steps, initial learning rate = 1e-5, linearly decaying to 1e-6 by the end of training.
  • Batch size of 32 examples (64 for smaller models).
  • Texts longer than 2048 tokens (LLaMA’s context window) are trimmed.

They also apply dropout over residual connections, starting at p_d = 0.0 at the bottom layer and linearly raising the rate to p_d = 0.3 at the last layer (p_d = 0.2 for smaller models).

They manually selected checkpoints between the 5th and the 10th epochs using the held-out 50 samples (not based on perplexity).

Tip

There are two interesting improvements/modifications over the original model: EOT and residual dropout.

Human Evaluation

Baselines: Alpaca 65B, DaVinci003, Bard, Claude, GPT-4.

Generation’s parameters: nucleus sampling (p=0.9), temperature of 0.7, repetition penalty of previous tokens of 1.2, max token length = 2048.

Multi-Turn Dialogue

The authors created a smal multi-turn dialogue dataset with 30 samples (10 manual, 20 based on modified comments from Stack Exchange).

They train a LIMA model on the 1,000 original samples + 30 multi-turn dialogue examples and show the performance of the model greatly improves (from 45.2% to 76.1% excellent responses).